Features:
- MPI I/O calls to operate on file
- MPI calls to coordinate file access
- Can perform collective I/O
Source Code Files:
src/H5FDmpio.c and src/H5FDmpio.h
API calls:
- H5Pset_fapl_mpio
- Dups the MPI Comm & Info objects (passed as parameters) and stores them in the FAPL for future use
- H5Pget_fapl_mpio
- Dups the MPI Comm & info objects (from FAPL) and passes them back to application
VFD callbacks implemented:
- open (H5FD_mpio_open)
- Dups the MPI Comm & Info objects (from the FAPL) for use in file & coordination operations
- MPI_File_Open() w/file's Comm & Info objects
- Gets rank & size from file's Comm
- If (rank == 0)
- If MPI_File_get_size() available, get file's size with that, otherwise use stat() file to get its size
- [MPI_File_get_size() detected at configure time]
- MPI_Bcast() file size to all ranks
- if (size > 0 & truncate flag set)
- MPI_File_set_size() to 0 size
- MPI_Barrier()
- close (H5FD_mpio_close)
- MPI_File_close()
- Free file's Comm & Info objects
- query (H5FD_mpio_query)
- Sets flags to indicate it's OK to aggregate allocation of file metadata and small raw data
- No other flags set
- get_eoa (H5FD_mpio_get_eoa)
- retrieve (local) EOA value
- set_eoa (H5FD_mpio_set_eoa)
- set (local) EOA value
- get_eof (H5FD_mpio_get_eof)
- get (local) EOF value
- get_handle (H5FD_mpio_get_handle)
- Directly copies MPI_File value back into application buffer
- read (H5FD_mpio_read)
- set buffer's MPI type (<buf_type>) to MPI_BYTE
- memset() MPI_Status to 0's
- convert address of I/O to MPI_Offset value (<mpi offset>)
- if ( I/O type is "raw data")
- Get <xfer mode> property from DXPL
- if ( <xfer mode> is Collective)
- Set <use view> flag
- get MPI type for buffer (<buf_type>) from DXPL
- get MPI type for file (<file_type>) from DXPL
- MPI_File_set_view() with <mpi offset>, <file type> & <buf type>
- Set <mpi offset> to 0
- if ( <use view> set) [i.e. we're doing collective I/O, and <mpi offset> = 0]
- get <collective opt> property from DXPL [chunked datasets only]
- if ( <collective opt> flag set)
- MPI_File_read_at_all() with <mpi offset>, <file type> & <buf type>
- else
- MPI_File_read_at() with <mpi offset>, <file type> & <buf type>
- MPI_File_set_view() with offset = 0 and MPI type = MPI_BYTE
- else
- MPI_File_read_at() with <mpi offset> and <buf_type> ( == MPI_BYTE)
- <compute number of bytes actually read>
- MPI_Get_elements() with MPI_Status from read call, MPI type set to MPI_BYTE to get the number of bytes read <bytes read>
- Get the <buf type>'s size with MPI_Type_size
- Compute the <I/O size>
- [This works (only) because the "basic elements" we use for all our MPI derived datatypes are MPI_BYTE. We should be using the <buf_type> in MPI_Get_elements(), but aren't because it caused the LANL "qsc" machine to dump core]
- if ( (<I/O size> - <bytes read>) > 0)
- memset() the application buffer with the number of bytes that weren't read
- write (H5FD_mpio_write)
- <same as for read callback, with MPI write calls instead>
- Except, doesn't 0-fill application buffer (of course)
- Also, reset (local) EOF value to "undefined"
- flush (H5FD_mpio_flush)
- if ( <not closing file> )
- MPI_File_sync()
- truncate (H5FD_mpio_truncate)
- if ( <EOA> is > than <last EOA> )
- if ( <MPI_File_set_size() works correctly> ) [set at configure time]
- Convert <EOA> value to MPI Offset
- MPI_File_set_size() w/ <EOA> (as MPI Offset)
- else
- MPI_Barrier()
- if (rank == 0 )
- Read byte at <EOA - 1>
- Write byte back to <EOA -1>
- MPI_Barrier()
- set <last EOA> to <EOA>